Csm-1b Architecture

CSM, short for Conversational Speech Model, makes a simple architectural bet: long-range dialogue belongs in one model path, while local acoustic detail belongs in another. Sesame describes CSM as a speech generator built from two LLaMA-style autoregressive transformers, a backbone that predicts the first RVQ codebook token for each audio frame and a smaller depth decoder that fills in the remaining codebooks inside that frame (Sesame architecture post, Sesame CSM repo). That split matters because speech contains two different workloads. The model needs language-level memory across turns, and it also needs dense acoustic texture every few milliseconds.

The Bottleneck

Autoregressive speech models pay for fidelity with sequence length. If an audio codec emits many codebook tokens per frame and the model flattens them into one stream, attention has to process a sequence that grows by every codebook, not by every moment in time. Sesame's CSM avoids that shape by making the backbone operate at the frame level while the depth decoder handles the within-frame codebooks (Sesame architecture post).

The design spends the large model on dialogue state and the small model on local sound detail. I like this architecture because it puts compute where the uncertainty lives. Conversation state changes across turns. The later acoustic codebooks refine a frame once the semantic anchor exists.

Mimi as the Contract

CSM uses Kyutai's Mimi audio codec to turn waveform audio into discrete tokens. Sesame describes Mimi in this system as producing one semantic codebook plus N-1 acoustic codebooks at 12.5 Hz, which means one frame every 80 ms (Sesame architecture post). Kyutai's Mimi model card describes the codec as combining semantic and acoustic information at 12.5 Hz and 1.1 kbps, while the Moshi paper explains the split-RVQ design that separates a semantic token stream from acoustic residual codebooks (Mimi model card, Moshi paper).

The first codebook carries the semantic anchor. The remaining codebooks carry acoustic residue: timbre, texture, and frame-level detail. Residual vector quantization gives the architecture a useful boundary. The backbone predicts the first codebook for the next frame. The depth decoder conditions on that anchor and generates the acoustic tail.

Backbone and Depth Decoder

Sesame presents the small public family as a "Tiny" CSM configuration with a roughly 1B-parameter backbone and roughly 100M-parameter decoder, both LLaMA-style autoregressive transformers (Sesame architecture post). The released csm-1b checkpoint is distributed as an open speech-generation model on Hugging Face, with sample code that tokenizes conversational turns and generates audio rather than text (csm-1b model card, Sesame CSM repo). Public metadata does not make every total-parameter accounting detail perfectly tidy, so the source-backed claim to keep is the backbone/depth-decoder split.

Part Job Token unit Why it exists
Backbone Predict the first RVQ codebook for the next frame One audio frame Keeps dialogue context short enough for long conversation
Depth decoder Predict the remaining codebooks for that frame Codebooks inside one frame Adds acoustic fidelity without making the backbone sequence explode
Mimi decoder Convert generated codebooks back to waveform audio Codec frames Turns tokens back into speech

The public docs do not support every low-level implementation claim people repeat about CSM. I would not describe the exact input tensor shape or embedding aggregation path unless you are reading the released code and citing the specific implementation. The reliable source-backed mechanism is the two-stage autoregressive split.

Inference Loop

Generation runs as a nested loop. First, the backbone reads the prior conversation context and predicts the first codebook for the next frame. Then the depth decoder generates the rest of the frame's codebooks. The completed frame goes back into the conversation context, and Mimi reconstructs waveform audio from the codebook sequence (Sesame architecture post, Sesame CSM repo).

This gives CSM two autoregressive clocks. The slow clock advances one 80 ms frame at a time. The fast clock runs through codebooks inside that frame. The benefit is sequence compression for the large backbone. The cost is serving complexity, because production inference has to coordinate the backbone, the depth decoder, tokenizer state, waveform reconstruction, and streaming audio output.

Training Amortization

The depth decoder creates the main training pressure. If Sesame trained every acoustic codebook for every frame in every long sequence, memory use would rise with the inner codebook loop. Sesame's solution is amortization: train the first codebook on every frame, but train the depth decoder on a random 1/16 of frames (Sesame architecture post).

That choice tells you how the authors view the problem. The backbone must learn every step of conversation flow. The acoustic decoder can learn from a sampled subset because local codebook prediction repeats the same kind of task frame after frame. The trick preserves the architectural split without forcing the training run to pay the full inner-loop cost at every position.

Deployment Boundary

The open CSM release is a speech generator, not a complete voice assistant. Sesame's repository says the model cannot generate text and recommends pairing it with a separate language model when you want a general conversational assistant (Sesame CSM repo). That boundary matters. CSM can make speech from conversational context, but product behavior still needs dialogue management, policy handling, retrieval, tool use, and text reasoning somewhere else.

Trade-Offs

CSM's strongest property is structured speech modeling. The backbone sees a compact frame-level sequence, which makes long conversational context more plausible than a flattened codebook stream. The decoder then spends smaller local compute on acoustic detail, where long context adds less value.

The trade-off is that the model does more orchestration than a single-stage TTS system. It has nested autoregression, codec dependencies, and a training scheme that samples decoder frames. My take: this is the right kind of complication. It complicates the serving path, but it removes a worse scaling problem from the backbone.

Architecture Diagram

The official Sesame diagram shows the inference process: backbone first, depth decoder second, then generated audio tokens feed back into the next frame.

CSM inference architecture from Sesame showing a backbone that predicts the first codebook and a depth decoder that fills in the remaining codebooks

Takeaways

CSM is useful because it makes the semantic/acoustic split explicit. The backbone models conversation at the frame level. The depth decoder handles the acoustic residue inside each frame. Mimi's RVQ tokens give the system a clean contract between those jobs. The main caveat is operational: nested autoregressive generation and codec coordination demand careful serving work. The main lesson travels well beyond speech: when one sequence contains two different compute regimes, do not make the largest model pay for both at the finest resolution.

References

  • Sesame, "Crossing the uncanny valley of voice"
  • Hugging Face model card, sesame/csm-1b
  • SesameAILabs CSM GitHub repository
  • Kyutai Mimi model card
  • Défossez et al., "Moshi: a speech-text foundation model for real-time dialogue"

author: Ope tag: #speech links: [[Full-Duplex Speech Models]], [[Multi-Token Prediction]], [[Small LLMs — Use Cases and Limits]]